import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris(as_frame=True)
iris_df = iris.frame
y = iris.target_names[iris.target] == 'virginica'
iris_df['target'] = y
#iris_df['target'] = y.astype(int)
# target == 'non-virginica'
iris_df[iris_df['target'] == 0].describe()
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
|---|---|---|---|---|
| count | 100.000000 | 100.000000 | 100.000000 | 100.000000 |
| mean | 5.471000 | 3.099000 | 2.861000 | 0.786000 |
| std | 0.641698 | 0.478739 | 1.449549 | 0.565153 |
| min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
| 25% | 5.000000 | 2.800000 | 1.500000 | 0.200000 |
| 50% | 5.400000 | 3.050000 | 2.450000 | 0.800000 |
| 75% | 5.900000 | 3.400000 | 4.325000 | 1.300000 |
| max | 7.000000 | 4.400000 | 5.100000 | 1.800000 |
# target == 'virginica'
iris_df[iris_df['target'] == 1].describe()
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
|---|---|---|---|---|
| count | 50.00000 | 50.000000 | 50.000000 | 50.00000 |
| mean | 6.58800 | 2.974000 | 5.552000 | 2.02600 |
| std | 0.63588 | 0.322497 | 0.551895 | 0.27465 |
| min | 4.90000 | 2.200000 | 4.500000 | 1.40000 |
| 25% | 6.22500 | 2.800000 | 5.100000 | 1.80000 |
| 50% | 6.50000 | 3.000000 | 5.550000 | 2.00000 |
| 75% | 6.90000 | 3.175000 | 5.875000 | 2.30000 |
| max | 7.90000 | 3.800000 | 6.900000 | 2.50000 |
This dataset allows for the classification of iris species based on the characteristics of iris. The independent variables are sepal length, sepal width, petal length, and petal width. This model will be designed to determine whether the iris species is virginica or not, based on the independent variables. If the target value is False (0), it is non-virginica, and if the target value is True (1), it is of the virginica species. The table summarizes the features of the flowers according to each target class.
import seaborn as sns
import matplotlib.pyplot as plt
fig, axs = plt.subplots(2, 2, figsize=(7, 7))
sns.histplot(data=iris_df, x="sepal length (cm)", hue="target", multiple="dodge", ax = axs[0, 0])
sns.histplot(data=iris_df, x="sepal width (cm)", hue="target", multiple="dodge", ax = axs[0, 1])
sns.histplot(data=iris_df, x="petal length (cm)", hue="target", multiple="dodge", ax = axs[1, 0])
sns.histplot(data=iris_df, x="petal width (cm)", hue="target", multiple="dodge", ax = axs[1, 1])
plt.show()
The histograms represent the frequency of each independent variable value and how it correlates to the target value, allowing for the determination of whether the iris is virginica or non-virginica. For non-virginica cases, the distribution is primarily found in areas with shorter sepal length, petal length, and petal width, while virginica tends to be distributed in areas with longer measurements. It can be observed that the distribution of sepal width is similar regardless of the target, indicating that it may not have a significant relationship with the target classification.
f, ax = plt.subplots(figsize=(6, 4))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(iris_df.corr(), annot=True, cmap=cmap)
<Axes: >
By examining the correlation matrix, we can determine how much correlation each variable has. When listed in order of highest correlation with the target, it is found to be petal width (0.77), petal length (0.72), sepal length (0.64), and sepal width (-0.14).
sns.pairplot(iris_df, hue="target", size=3, diag_kind="kde")
c:\Users\flag8\Work\CSCN8010\venv\CSCN8010_classic_ml\Lib\site-packages\seaborn\axisgrid.py:2095: UserWarning: The `size` parameter has been renamed to `height`; please update your code.
<seaborn.axisgrid.PairGrid at 0x224137917d0>
This pairplot shows the relationships between each independent variable and the distribution of the target values based on those variables. Since we need to create a model that distinguishes between virginica and non-virginica, we must select variables for the model design where the target values are clearly distinguishable. Overall, it can be observed that plots involving sepal width values do not clearly separate the target. In contrast, plots for the other variables generally allow for a clear distinction between the target values.
iris_df.boxplot(by="target", figsize=(12, 6))
array([[<Axes: title={'center': 'petal length (cm)'}, xlabel='[target]'>,
<Axes: title={'center': 'petal width (cm)'}, xlabel='[target]'>],
[<Axes: title={'center': 'sepal length (cm)'}, xlabel='[target]'>,
<Axes: title={'center': 'sepal width (cm)'}, xlabel='[target]'>]],
dtype=object)
This boxplot represents the distribution of variable values according to each target. It can be seen that for virginica, the distribution of petal length is primarily towards longer lengths, while for non-virginica, it is the opposite. For petal width and sepal length, it can also be observed that values tend to be higher for cases classified as virginica, though the difference is less pronounced compared to petal length. In the case of sepal width, the distribution is almost similar for both targets. Therefore, it can be predicted that creating a model using petal length as an independent variable would be advisable.
# A final multivariate visualization technique pandas has is radviz
# Which puts each feature as a point on a 2D plane, and then simulates
# having each sample attached to those points through a spring weighted
# by the relative value for that feature
from pandas.plotting import radviz
radviz(iris_df, "target")
<Axes: >
RadViz enables the projection of an N-dimensional dataset into a 2D space, where the impact of each dimension is seen as a balance among the influences of all dimensions. Data that show high correlation are positioned nearer to each other on the unit circle. Most virginica and non-virginica cases cluster together in the middle, but it can be observed that certain non-virginica cases are significantly influenced by sepal width. Therefore, it can be concluded that using sepal width as an indicator to represent the general characteristics of all non-virginica cases is challenging.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris_df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']],
y, test_size = 0.2)
X_val, X_test, y_val, y_test = train_test_split(X_test[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']],
y_test, test_size = 0.5)
from sklearn.linear_model import LogisticRegression
log_reg_1 = LogisticRegression()
log_reg_2 = LogisticRegression()
log_reg_3 = LogisticRegression()
log_reg_4 = LogisticRegression()
log_reg_1.fit(X_train[['petal width (cm)']], y_train)
log_reg_2.fit(X_train[['petal width (cm)', 'petal length (cm)']], y_train)
log_reg_3.fit(X_train[['petal width (cm)', 'petal length (cm)', 'sepal length (cm)']], y_train)
log_reg_4.fit(X_train[['petal width (cm)', 'petal length (cm)', 'sepal length (cm)', 'sepal width (cm)']], y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
Based on the analysis results, the model was constructed by adding features in the order of highest correlation: petal width, petal length, and then sepal length. The feature that presents difficulty in classifying classes, sepal width, was added last.
To evaluate each model and observe the actual accuracy, models with 1 to 4 features were executed. The measure used to compare each model was the score() function of the LogisticRegression model. The score() function calculates the mean accuracy, which is the probability that the predictions on the test set match the actual values.
1-feature model
# predict class
pred = log_reg_1.predict(X_val[['petal width (cm)']])
# probability estimates
prob = log_reg_1.predict_proba(X_val[['petal width (cm)']])
# mean accuracy score
print("Mean accuracy of the model: ", log_reg_1.score(X_val[['petal width (cm)']], y_val))
# print table
pd.DataFrame({'probability of predicting verginica': prob[:, 1].round(3),
'actual prediction': pred, 'ground truth': y_val})
Mean accuracy of the model: 1.0
| probability of predicting verginica | actual prediction | ground truth | |
|---|---|---|---|
| 0 | 0.003 | False | False |
| 1 | 0.006 | False | False |
| 2 | 0.334 | False | False |
| 3 | 0.334 | False | False |
| 4 | 0.846 | True | True |
| 5 | 0.624 | True | True |
| 6 | 0.624 | True | True |
| 7 | 0.948 | True | True |
| 8 | 0.786 | True | True |
| 9 | 0.786 | True | True |
| 10 | 0.334 | False | False |
| 11 | 0.003 | False | False |
| 12 | 0.004 | False | False |
| 13 | 0.003 | False | False |
| 14 | 0.924 | True | True |
2-feature model
# predict class
pred = log_reg_2.predict(X_val[['petal width (cm)', 'petal length (cm)']])
# probability estimates
prob = log_reg_2.predict_proba(X_val[['petal width (cm)', 'petal length (cm)']])
# mean accuracy score
print("Mean accuracy of the model: ", log_reg_2.score(X_val[['petal width (cm)', 'petal length (cm)']], y_val))
# print table
pd.DataFrame({'probability of predicting verginica': prob[:, 1].round(3),
'actual prediction': pred, 'ground truth': y_val})
Mean accuracy of the model: 0.9333333333333333
| probability of predicting verginica | actual prediction | ground truth | |
|---|---|---|---|
| 0 | 0.000 | False | False |
| 1 | 0.000 | False | False |
| 2 | 0.104 | False | False |
| 3 | 0.198 | False | False |
| 4 | 0.968 | True | True |
| 5 | 0.499 | False | True |
| 6 | 0.679 | True | True |
| 7 | 0.964 | True | True |
| 8 | 0.994 | True | True |
| 9 | 0.764 | True | True |
| 10 | 0.241 | False | False |
| 11 | 0.000 | False | False |
| 12 | 0.000 | False | False |
| 13 | 0.000 | False | False |
| 14 | 0.979 | True | True |
3-feature model
# predict class
pred = log_reg_3.predict(X_val[['petal width (cm)', 'petal length (cm)', 'sepal length (cm)']])
# probability estimates
prob = log_reg_3.predict_proba(X_val[['petal width (cm)', 'petal length (cm)', 'sepal length (cm)']])
# mean accuracy score
print("Mean accuracy of the model: ", log_reg_3.score(X_val[['petal width (cm)', 'petal length (cm)', 'sepal length (cm)']], y_val))
# print table
pd.DataFrame({'probability of predicting verginica': prob[:, 1].round(3),
'actual prediction': pred, 'ground truth': y_val})
Mean accuracy of the model: 0.9333333333333333
| probability of predicting verginica | actual prediction | ground truth | |
|---|---|---|---|
| 0 | 0.000 | False | False |
| 1 | 0.000 | False | False |
| 2 | 0.106 | False | False |
| 3 | 0.226 | False | False |
| 4 | 0.961 | True | True |
| 5 | 0.496 | False | True |
| 6 | 0.708 | True | True |
| 7 | 0.961 | True | True |
| 8 | 0.993 | True | True |
| 9 | 0.749 | True | True |
| 10 | 0.216 | False | False |
| 11 | 0.000 | False | False |
| 12 | 0.000 | False | False |
| 13 | 0.000 | False | False |
| 14 | 0.977 | True | True |
4-feature model
# predict class
pred = log_reg_4.predict(X_val[['petal width (cm)', 'petal length (cm)', 'sepal length (cm)', 'sepal width (cm)']])
# probability estimates
prob = log_reg_4.predict_proba(X_val[['petal width (cm)', 'petal length (cm)', 'sepal length (cm)', 'sepal width (cm)']])
# mean accuracy score
print("Mean accuracy of the model: ", log_reg_4.score(X_val[['petal width (cm)', 'petal length (cm)', 'sepal length (cm)', 'sepal width (cm)']], y_val))
# print table
pd.DataFrame({'probability of predicting verginica': prob[:, 1].round(3),
'actual prediction': pred, 'ground truth': y_val})
Mean accuracy of the model: 1.0
| probability of predicting verginica | actual prediction | ground truth | |
|---|---|---|---|
| 0 | 0.000 | False | False |
| 1 | 0.000 | False | False |
| 2 | 0.099 | False | False |
| 3 | 0.211 | False | False |
| 4 | 0.962 | True | True |
| 5 | 0.501 | True | True |
| 6 | 0.696 | True | True |
| 7 | 0.959 | True | True |
| 8 | 0.993 | True | True |
| 9 | 0.732 | True | True |
| 10 | 0.220 | False | False |
| 11 | 0.000 | False | False |
| 12 | 0.000 | False | False |
| 13 | 0.000 | False | False |
| 14 | 0.975 | True | True |
import plotly.express as px
import plotly.graph_objects as go
import plotly.offline as po
po.offline.init_notebook_mode()
X_val['target'] = y_val
1-feature
fig = px.scatter(X_val, x='petal width (cm)', y = 'target')
decision_boundary = -log_reg_1.intercept_ / log_reg_1.coef_
fig.add_trace(go.Scatter(x=[decision_boundary[0][0], decision_boundary[0][0]], y=[False, True], mode="lines", name="Decision Boundary"))
fig.show()
2-feature
import numpy as np
fig = px.scatter(X_val, x='petal width (cm)', y='petal length (cm)', color='target')
decision_boundary_x1 = np.linspace(iris_df[['petal width (cm)']].min(), iris_df[['petal width (cm)']].max(), 10)
decision_boundary_x2 = -log_reg_2.intercept_ / log_reg_2.coef_[0][1] - log_reg_2.coef_[0][0] / log_reg_2.coef_[0][1] * decision_boundary_x1
fig.add_trace(go.Scatter(x=[decision_boundary_x1[0][0], decision_boundary_x1[9][0]], y=[decision_boundary_x2[0][0], decision_boundary_x2[9][0]],
mode="lines", name="Decision Boundary"))
fig.show()
3-feature
fig = px.scatter_3d(X_val, x='petal width (cm)', y='petal length (cm)', z='sepal length (cm)', color='target')
x1, x2 = np.meshgrid(np.linspace(iris_df[['petal width (cm)']].min(), iris_df[['petal width (cm)']].max(), 10),
np.linspace(iris_df[['petal length (cm)']].min(), iris_df[['petal length (cm)']].max(), 10))
decision_boundary_x3 = -log_reg_3.intercept_ / log_reg_3.coef_[0][2] - log_reg_3.coef_[0][0] / log_reg_3.coef_[0][2] * x1 - log_reg_3.coef_[0][1] / log_reg_3.coef_[0][2] * x2
fig.add_trace(go.Surface(x=x1, y=x2, z=decision_boundary_x3, showscale=False, showlegend=True, opacity=0.8, name="Decision Boundary"))
fig.show()
When the models were evaluated using the validation set, the 1 and 4 feature model did not have any instances of incorrect classification. However, for the models with 2 and 3 features, each had one prediction out of 15 instances that was incorrect compared to the actual value. Upon examining the instances where the prediction was incorrect, it was found that all were the 6th instance in the validation dataset. Looking at the probability of predicting virginica for this instance, for the 2 and 3 feature models respectively, the probabilities were 0.499 and 0.496. This indicates that the probabilities are nearly 0.5, suggesting that, when viewed on a plot, these values are close to the decision boundary. When examining the plot, it is observed that the point lying on the decision boundary has a true original value, but the model predicted it as false.
print(X_val.iloc[6])
sepal length (cm) 5.9 sepal width (cm) 3.0 petal length (cm) 5.1 petal width (cm) 1.8 target True Name: 149, dtype: object
In a 2-feature model, looking at data with petal length and width at 5.1 cm and 1.8 cm, compared to the histogram, showing the model might predict wrongly. In a 3-feature model, adding sepal length make harder to distinctly classify the flower. This suggests adding more features might not always improve the model's predictions.
# predict class
pred = log_reg_1.predict(X_test[['petal width (cm)']])
# probability estimates
prob = log_reg_1.predict_proba(X_test[['petal width (cm)']])
# mean accuracy score
print("Mean accuracy of the model: ", log_reg_1.score(X_test[['petal width (cm)']], y_test))
# print table
pd.DataFrame({'probability of predicting verginica': prob[:, 1].round(3),
'actual prediction': pred, 'ground truth': y_test})
Mean accuracy of the model: 0.9333333333333333
| probability of predicting verginica | actual prediction | ground truth | |
|---|---|---|---|
| 0 | 0.334 | False | False |
| 1 | 0.334 | False | False |
| 2 | 0.184 | False | False |
| 3 | 0.003 | False | False |
| 4 | 0.334 | False | False |
| 5 | 0.252 | False | False |
| 6 | 0.924 | True | True |
| 7 | 0.786 | True | True |
| 8 | 0.003 | False | False |
| 9 | 0.624 | True | True |
| 10 | 0.786 | True | True |
| 11 | 0.252 | False | True |
| 12 | 0.004 | False | False |
| 13 | 0.003 | False | False |
| 14 | 0.002 | False | False |
When using the score() function, which represents the accuracy of the model predictions, the evaluation on the validation set showed that the model with 1 and 4 features had a score of 1.00, while the rest of the models scored 0.93. This indicates that the model with 1 and 4 features are the most accurate. However, adding sepal width, which has a low correlation with the target, to create a 4-feature model, and considering overfitting, it can be said that the 1-feature model is the best model. Nevertheless, the difference in accuracy among the models may not be considered significant, and it's possible that the models perform well only on this specific dataset. Additionally, given that the training data consisted of only 120 instances and the validation and test data used only 15 instances, it might be challenging to consider these results as definitive. A much larger dataset would be required to accurately evaluate which model is superior.
In reality, when the model with 1 feature was evaluated on the test set, a score of 0.93 was obtained, meaning that 14 out of the 15 instances were correctly predicted.